Data cleaning

Quantitative Methodology (UPF)

Jordi Mas Elias

https://www.jordimas.cat/

Summary

  • Introduction
  • Tidy data
  • Missing data
  • Join data
  • Codebooks

Warm up

Warm up

Introduction

Introduction

“Data Scientists spend up to 80% of the time on data cleaning and 20% on actual data analysis”.

Source: R for Data Science

Introduction

A df is tidy if it fulfills these requirements (Wickham 2014):

  • Each df has one unit of observation.
  • Observations represented in the rows.
  • Variables represented in the columns.
  • Each cell indicates a value.

Source: R for Data Science

Introduction

Why do we need to clean data?

  • Data is not tidy.
  • Data is in many datasets.
  • Data is missing.

Introduction

Load packages.

library(dplyr)
library(readr)
library(tidyr)
library(countrycode)

Tidy data

Untidy dataframes?

Number of TB cases documented by the WHO in Afghanistan, Brazil, and China between 1999 and 2000 (cases & population).

Table A

# A tibble: 6 × 3
  country      year rate             
* <chr>       <int> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

Table B

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <int>  <int>      <int>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Table C

# A tibble: 12 × 4
   country      year type            count
   <chr>       <int> <chr>           <int>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

Table D

# A tibble: 3 × 3
  country     `1999` `2000`
* <chr>        <int>  <int>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

A tidy dataframe

NOO ANOTHER

Correlates of War Intra-State War Data

# A tibble: 442 × 4
   WarName            WarType SideA          SideB                           
   <chr>                <dbl> <chr>          <chr>                           
 1 First Caucasus           5 Russia         Georgians, Dhagestania, Chechens
 2 Sidon-Damascus           6 Sidon          Damascus & Aleppo               
 3 First Two Sicilies       4 Austria        -8                              
 4 First Two Sicilies       4 Two Sicilies   Liberals                        
 5 Spanish Royalists        4 Spain          Royalists                       
 6 Sardinian Revolt         4 Austria        -8                              
 7 Sardinian Revolt         4 Sardinia       Carbonari                       
 8 Greek Independence       5 Ottoman Empire Greeks                          
 9 Greek Independence       5 -8             United Kingdom                  
10 Greek Independence       5 -8             France                          
# … with 432 more rows

An untidy dataframe

PEW Research Religious Landscape Study1

# A tibble: 18 × 6
   religion                `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k`
   <chr>                     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 Agnostic                     27        34        60        81        76
 2 Atheist                      12        27        37        52        35
 3 Buddhist                     27        21        30        34        33
 4 Catholic                    418       617       732       670       638
 5 Don’t know/refused           15        14        15        11        10
 6 Evangelical Prot            575       869      1064       982       881
 7 Hindu                         1         9         7         9        11
 8 Historically Black Prot     228       244       236       238       197
 9 Jehovah's Witness            20        27        24        24        21
10 Jewish                       19        19        25        25        30
11 Mainline Prot               289       495       619       655       651
12 Mormon                       29        40        48        51        56
13 Muslim                        6         7         9        10         9
14 Orthodox                     13        17        23        32        32
15 Other Christian               9         7        11        13        13
16 Other Faiths                 20        33        40        46        49
17 Other World Religions         5         2         3         4         2
18 Unaffiliated                217       299       374       365       341

Pivoting data

We change the rows and columns of the dataframe keeping the same information.

  • Pivot longer
pivot_longer(df, cols, names_to, names_to)
  • Pivot wider
pivot_wider(df, names_from, values_from)
  • Separate
separate(df, col, into, sep)
  • Unite
unite(df, col, ..., sep)

Join data

Join data

Classifications

  • States: Countrycode, ISO3c, Eurostat.
  • Regions: UN.
  • Intergovernmental Organizations: COW
  • Political parties: Party facts
  • Municipalities: IDESCAT, INE
  • Census sections: IDESCAT

Missing data

Why they are missing?

NA (Not Available)

Related with the state capacity to collect them (Stone2008?)

Check the World Bank

Small or underdeveloped states

Codebooks

Bibliography

Wickham, Hadley. 2014. Tidy Data.” Journal of Statistical Software 50 (10): 1–23.